Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Experimental) Add support to NTK RoPE scaling #118

Merged
merged 6 commits into from
Jul 1, 2023

Conversation

Panchovix
Copy link
Contributor

@Panchovix Panchovix commented Jun 29, 2023

This adds support for the new NTK RoPE scaling, mentioned in #115.

"According to this post, this is a method of rope scaling that result in less perplexity loss and a bigger possible scaling:
https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/"

Adds the parameter "a", "alpha", which is used when loading a model with "-a"

Tested on 65B models at 4K context, with 48GB VRAM (2x24) using gs 16,20

image

Perplexity:
For tulu-30B-GPTQ (non-SuperHOT)

  • Perplexity at 2048 ctx (no compress_pos_emb, no alpha RoPE): 5.2153
  • Perplexity at 8192 ctx, compress_pos_emb = 4: 10.0813
  • Perplexity at 8192 ctx, alpha = 4: 5.3534
  • Perplexity at 8192 ctx, compress_pos_emb = 4, alpha = 4: 15.4406

For Tulu-30B-SuperHOT-8K-4bit-32g:

  • Perplexity at 8192 ctx, compress_pos_emb = 4: 5.8166
  • Perplexity at 8192 ctx, alpha = 4: 7.5073
  • Perplexity at 8192 ctx, compress_pos_emb = 4, alpha = 4: 6.0903

Note: For 8K ctx and above, I suggest to keep using SuperHOT.

@Panchovix Panchovix mentioned this pull request Jun 29, 2023
@Panchovix
Copy link
Contributor Author

Panchovix commented Jun 30, 2023

Important update: Before, the alpha value wasn't being applied correctly. Now, it does it correctly, and thus, just by setting alpha for NTK RoPE scaling would be enough (without the need to set compress_pos_emb to the same value)

Also, added perplex on a test of a 30B model.

@turboderp
Copy link
Owner

I might refactor this a bit later, but it seems okay. I'll merge it as is for now.

@fahadh4ilyas
Copy link

Hi, I'm confused with the current code. It seems like compress_pos_emb is still used beside alpha value especially if we set compress_pos_emb not equal 1 to scale the value of t. But, dynamic scaling seems didn't do that. Is this intentional?

@Panchovix
Copy link
Contributor Author

Panchovix commented Jul 15, 2023

Hi, I'm confused with the current code. It seems like compress_pos_emb is still used beside alpha value especially if we set compress_pos_emb not equal 1 to scale the value of t. But, dynamic scaling seems didn't do that. Is this intentional?

If compress_pos_emb is set to 1, the rotatory embedding base is still set at 10000. (As if nothing changed)

Ideally you just want to set either compression or alpha, not both at the same time (for example, do not use compress 2 and alpha 2)

Also, this implementation of NTK is static RoPE scaling. Dynamic NTK Scaling isn't implemented yet on exllama (it depends of the context length when generating)

@fahadh4ilyas
Copy link

Oh thank you for the clarification. So the base value is static based on the alpha. correct? Then after that, we could generate with context more than default context size? If I check the graph from this reddit post, with alpha 4, I could generate with context size 5000 with perplexity not exploded just like the yellow line in the graph?

@Panchovix
Copy link
Contributor Author

@fahadh4ilyas correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants